Median
Learn how to use the median to improve the accuracy of anomaly detection.
In statistics, a mean is not considered robust because extreme values influence it. Given our use case, the measure we use to identify extreme values is affected by those values we are trying to identify.
For example, at the beginning of the article, we used this series of values:
The mean of this series is 4.33, and we detected 12 as an anomaly.
If the 12 were a 120, the mean of the series would have been 16.33. Hence, our “reasonable” value is heavily affected by the values it is supposed to identify.
The median is considered a more robust measure. The median of a series is the value that half the series is greater than, and half the series is less than:
To calculate the median in PostgreSQL we use the function percentile_disc. In the series above, the median is 3. If we sort the list and cut it in the middle, the median will become clearer:
2, 2, 3, 3, 3
4, 5, 5, 12
If we change the value of 12 to 120, the median will not be affected at all:
2, 2, 3, 3, 3
4, 5, 5, 120
This is why a median is considered more robust than mean.